Final project for the Introduction to Data Science / Text as Data class

By Adelaida Barrera (), Natalia Mejía, (), Mariana Saldarriaga () and Isabel de Brigard ()

Introduction: what we set out to do

This semester we seem to be uninterruptedly glued to our screens. From zoom, to R, from moodle to social media, more and more of our days are spent in front of our phones or computers. But this trend, though exacerbated by the unusual conditions of the last year (ugh, yes, it has almost been a year) did not start with some flying rodent far away. Digital platforms have carved up more and more of our time, and seem to direct more and more of our actions. As policy students, we were interested in one instance where this seems to be happening very notably: the way twitter communicates, condenses, and shapes public discourse around salient policy issues. -Guess it’s not procrastination if you can call it research…

But the question “How does Twitter shape public discourse?” seemed a bit ambitious for finals and the four hours of daylight Berlin offers at the end of the year. So we decided to narrow our focus and explore the public discourse around feminism and gender issues revealed by a select (yes, select as in selected by us, more on this in a minute) group of twitter accounts of activists, political leaders, writers, and all around opinion shapers from Colombia. We wanted to flex our newly acquired web scraping and text mining muscles, as well as try our hand at some initial network analysis. It was a bumpy-but-fun road that mostly got us excited to keep at it, till we are fully fluent in data science (looking at you regular expressions).

In what follows you will find, first, a brief section on our sample. From how we chose the accounts and what a savage beast our data was at the beginning, to how we tried to tame it and what it looked like when finally we decided to take it for a spin. Then we will get down to business and, with the help of some bi-grams and topic modeling, ask what these accounts actually talk about. We will then attempt to understand how they relate to or differ from one another. You’ll find some scaling and some network analysis. Finally, with the help of sentiment analysis, we will explore how these tweeters feel about a couple of interesting and somewhat controversial topics.

Because we are in academia after all, a few caveats before we begin.
    • We are aware that the results we get are not robust enough to support any life altering hypothesis. For the most part, what we get is an interesting exploration of the methods we learned, and a whole lot of experience as to how difficult it is to do original, high-quality, data-driven research. Who knew.
    • Will tried to warn us: the assumptions that we need to make to take a TADA approach are strong and not always easily fulfilled. And twitter is hardly a place one associates with stability or convention. However, we were also told we could leave our linguistic and psychological aspirations at the door and still have some fun just treating text as data. Explicar esto mejor.
    • We are particularly self-conscious about our scaling space and the potential substantial meaning for it. Beyond the humble realm of our data, it might not be enough to justify any interpretations about what it could be describing. Nonetheless, within our bounds, we found some interesting stuff. Bare with us, and leave generalizations for after we win that huge research grant.

    Despite all this, we do think we can draw a couple of interesting conclusions from this first glimpse into our local twitterverse. These are the highlights:
    (Aquí pondría un par de bulletpoints con algunas conclusiones que queden al final, si queda alguna.)

The sample

How did we choose the twitter accounts? How did we gather the data?

Our initial intuition was that certain twitter accounts shape public discourse and that gathering those would give us a balanced and relatively complete picture of what most twitter-talk was about. This is partly the idea behind the Cifras y conceptos opinion leaders panel, that traces the opinion of various individuals on a wide range of topics. These opinion leaders, they say, “differ from public opinion in general, because they are the ones who guide the climate of opinion, have the capacity for foresight and influence political issues and issues on the national agenda” (cita https://cifrasyconceptos.com/productos-panel-de-opinion/), an so tracing their points of view should be telling of more than their personal standing on a given topic.

So we dived into the twitterverse to see who came out to greet us. With a combination of research, personal experience and some calls to people in the Colombian political sphere we came up with 69 individuals and 39 institutional accounts that we felt had to be included if we were interested in what was being said about feminism and gender issues. This gave us an initial tweet count of upwards of XXX, which seemed like a decent amount of text to begin with.

But yes. We know. This is not a complete, balanced, objective picture of the public discourse on twitter on these issues. Moreover, there is no way to know from our data how biased or incomplete our sample is. We know. Remember the part about the research grant? Well, the phone still hasn’t rang. But we decided to keep going with what we had. This was our thinking: our agonizing about how bad our selection was only clarified further what our data science professors have been telling us since Stats I: fancy analytical tools only get you so far. If you actually want to be able to say something about the world, you need to work on your theory. Really work on it. But we felt this was an exercise about the tools we had learned. The tools, not the theory. And for that -to try our hand on a limited sample- we had enough. The rest was standard web scraping. Yes, that phrase actually makes sense to us now. We set up our API authorization and scraped away. And what a beautiful mess we got.

What did our data look like?

In our initial exploration of the data, we looked at the average tweets per individual account and the less recent tweet by account. We then plotted the frequency of tweets across time:

This exploration showed that, because some of the accounts posted content much less frequently than others, the last 3.200 tweets of each account represented very different time periods. We thus limited our data to tweets from the last 6 months and plotted those:

This produced a much more balanced sample, with 67 individual accounts and 116.402 tweets.

From this sample, we then removed 22 accounts from congresswomen. We decided to do this after having run a topic model with a random sample of 7000 tweets -which was already a stretch for our 2012 laptops. Although we knew this had implications for what we would be able to say about in our analysis later, these accounts had too much content pertaining to topics other than gender / feminism and including them would have made it even harder to get a sense of what public discourse around these issues actually is.

Finally, we restricted the institutional accounts to match the same period we had chosen for the individual ones, and ended up with 39 accounts and 25.941 tweets. Here, because the institutions we had chosen are explicitly dedicated to the topics we were interested in, there was no need to leave anyone out. Institutions are, well, more institutional…

Cleaning the data

With our data ready and the help of quantada, we created a corpus. Finally, text was data. And so we did what any text miner would do: we got our rags and buckets out, put our aprons on, and began cleaning.
We removed stop words (both those that come in the tm package, as some we compiled in our own list), punctuation, numbers, and symbols. Then we removed mentions: we were after the what is what, more than the who is who of Colombian feminist tweeter. (And we would get to connections later on, with the network analysis.) Next were hashtags. Here, again, we understood that this would limit our analysis somewhat, but we felt we had a solid theory based reason for it. So we had that going for us, which is nice. The reason is that hashtags tend to work globally, as a shortcut to the apparently borderless internet conversacion. And we felt including them might disrupt the picture of the more local discourse we were trying to paint. (Esto no creo que esté bien explicado.)

And then, we did it all over again for our institutional accounts. By this time, this project was beginning to feel a little like what we figure raising twins must be like: you do a lot of cleaning. And you do it all twice. But we have gotten this far. And we were finally ready to see what all these tweets were about.

What are individual and institutional accounts talking about?

What are the most common expressions?

What topics can we identify in their accounts? Do different profiles (by occupation talk more about certain topics?)

What is each topic about? Word-topic probabilities

Do these topics change in time?

What is the main topic for each account

Document-topic probabilities

How do they relate / differ from each other

How would they be distributed if put in a single scale? What do the extremes of the scale seem to represent?

Where would main twitter opinion leaders be in their scale?

How are they connected on Twitter?

How do they feel about controversial topics? How do they feel about the issue of trans women?

How do they feel about the issue of sexual misconduct ?

How do they feel about the issue of abortion ?

How do they feel about the issue of sexual work?

Final remarks